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Abstract 

A one-sided Jacobi hyperbolic singular value decomposition (HSVD) algorithm, 
using a massively parallel graphics processing unit (GPU), is developed. The al- 
gorithm also serves as the final stage of solving a symmetric indefinite eigenvalue 
problem. Numerical testing demonstrates the gains in speed and accuracy over 
sequential and MPI-parallelized variants of similar Jacobi-type HSVD algorithms. 
Finally, possibilities of hybrid CPU-GPU parallelism are discussed. 
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1 Introduction 

In this paper a hyperbolic SVD algorithm (HSVD) for graphics processors (GPUs) is 
developed. To the best of our knowledge, this is the first one-sided Jacobi-type HSVD 
algorithm for full column rank matrices of arbitrary dimensions, using GPUs. 

The first paper that we know of, which describes a Jacobi-type computation of SVD 
on GPUs, was published by Zhang and Dou in [27]. According to the English summary 
of the paper, they compute the SVD by using the one-sided Jacobi algorithm on a matrix 
of order 512. The order of orthogonalization of pairs of columns is chosen by the parallel 
pivot strategy described in pQ. Later, Lahabar and Narayanan [9 have computed SVD 
on GPUs by bidiagonalization followed by a bidiagonal QR algorithm (in single precision 
arithmetic). Finally, Sachdev, Vanjani and Hall in |15| have presented an algorithm for 
Takagi SVD (for complex symmetric matrices) by a symmetrized version of the two-sided 
Kogbetliantz algorithm (in single precision arithmetic). 

Given a Hermitian indefinite matrix M. factorized as M = GJG*, with G of full 
column rank and J a diagonal matrix holding the inertia of M, we aim at computing the 
HSVD of the factor G, 

£" 



G = U 







V*, (1.1) 



where U is unitary, V is J-unitary (i.e., V*JV = J), and £ is a diagonal matrix with 
positive diagonal entries. 

If G is not of the full column rank, £ in (Jl.ip , is not diagonal [2"61 Remark 5], and the 
non-diagonal part reflects the loss of rank of GJG* compared with the rank of G. In the 
(very important) special case, when J = I, the HSVD is the ordinary SVD. 
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If the HSVD of a factor G is known, then the eigendecomposition of M is readily 
available as 

V*JV [S 0] 17* = [/(£ 2 J)[T 



M = G7G* = f/ 



i.e., the matrix U of left singular vectors of G is also the eigenvector matrix of M, while 
the eigenvalues of M are diagonal elements of E 2 J. 

In many applications, M is already given implicitly, by its rectangular factor G, and 
the signature matrix J. That being the case, G should be shortened either by the QR 
factorization (if G is tall) or by the hyperbolic QR factorization (see [18J) of G* (if G* 
is tall), for the efficiency of the HSVD algorithm. In both cases, the HSVD of a square 
matrix will be computed, either of R (in the case of the QR factorization) , or of R* (in 
the case of hyperbolic QR factorization). To obtain matrices of singular vectors, after the 
completion of the HSVD, either U (in the case of ordinary QR factorization), or V (in 
the case of hyperbolic QR factorization) should be prcmultiplied by Q. 

On the other hand, if M is given explicitly, G and J are usually computed by the 
Hcrmitian indefinite factorization with postprocessing (see [22 )i which always produces 
G of the full column rank. Note that G obtained by the Hermitian indefinite factorization 
can be rectangular with more rows than columns, and should be shortened by the QR 
factorization. Therefore, it is sufficient to efficiently compute HVSD of a square matrix. 

The HSVD also serves as the second stage in solving a Hermitian indefinite eigenprob- 
lem. Our method of choice for such a computation is the one-sided hyperbolic Jacobi 
algorithm [25J, which has been much studied in the recent years, due to its high relative 
accuracy |23J and various possibilities of the efficient blocking and parallelization in the 
scope of the conventional CPU computing [211 |2"U] . 

The efficient GPU solution (especially, one with minimal CPU intervention) for the 
first stage (i.e., Hermitian indefinite factorization) remains an open problem. 

We will show that the Jacobi algorithm is elegantly parallelizable on the modern GPUs, 
utilizing them as a primary target for the algorithm execution. The CPU acts only as 
the ancillary unit: for driving and synchronizing the GPU computation, and for data 
initialization on the GPU. 

The numerical experiments demonstrate the benefits of such an approach, compared 
to the fastest existing sequential and multi-process parallel Jacobi implementations. Sig- 
nificant speedup vs. the sequential, and moderate speedup vs. the CPU-parallel algorithm 
with 4 processes are obtained. 

Finally, the true strength of our approach is discussed, through the possibilities of 
combining the GPU parallelism with the traditional one, for the very large problems. 

The rest of the paper is organized as follows. In Section [5] the essentials of the Jacobi 
HSVD and basic tools for its parallelization are presented. Section [3] deals with the 
properties and constraints of a class of GPU computing platforms our algorithm, named 
GPUJACH1, is designed for. The algorithm itself is detailed in Section @] and it is 
compared, by numerical testing, with a sequential and a CPU-parallel implementation of 
the Jacobi HSVD in Section [3] We conclude with a discussion of the possible applications 
of GPUJACH1 in the context of hybrid CPU-GPU parallelism, and with notes on future 
work, in Section [6] The Appendix contains a proof of convergence for the modulus pivot 
strategy. 



2 The essentials of the Jacobi HSVD 

The hyperbolic one-sided Jacobi algorithm implicitly diagonalizes a definite pair (A, J), 
where A := G*G, by J-unitary congruences [23 . "Implicitly" in this context means that 
the columns of G are orthogonalized. 

The idea of the one-sided algorithm is to multiply the columns of G^ ) :— G from the 
right-hand side by J-unitary matrices V/T* , k — 1, 2, . . . , s, to obtain the matrix G^ , with 
numerically orthogonal (not orthonormal) columns. If [V _ *](°) is defined as [V~*](°) = I, 



2 



then 



G (k) = G (k-i)y-*^ 



[v- 



-*](k) _ j^-*](fc-l) Vr * 



v,: 



fc = l,2, 



(2.1) 



If G^ has sufficiently orthogonal columns, [y - *]^ serves as a quite good approximation 
of V~*. Now, the matrix V of right (hyperbolic) singular vectors is easily obtainable from 
V~*, since V is a J-unitary matrix, i.e., V = JV~*J. 

The hyperbolic singular values Oi are norms of the final G^ , while the approximate 
matrix U of left singular vectors is computed by column scaling of G^ by diagonal matrix 
E- 1 =diag(ar\...,a- 1 ). 

If the algorithm is used only for the eigenvalue computation, V * need not be accu- 
mulated, since U keeps the eigenvectors of M — GJG* , while the eigenvalues are squares 
of singular values multiplied by a correct sign from J, i.e., = ofju- 

Just for simplicity, from now on, we restrict ourselves to the real case, i.e., to symmetric 
matrices instead of Hermitian. To emphasize this, we will use symbol T (instead of *) for 
transposition. 

The J-unitary matrices V^ T from (|2.ip are usually very simple - they are plane 
rotations (trigonometric and hyperbolic) . Such a rotation orthogonalizes a pair of columns 
of G. Their non-identity parts can be represented as 



Vr 



Vh 1 ■= 



cos tp sin tp 

— sin tp cos tp 

cosh tp sinh tp 

sinh tp cosh tp 



1 tan tp 
— tan ip 1 

1 tanh tp 
tanh tp 1 



cos tp 
cos tp 

cosh tp 
cosh tp 



(2.2) 



Note that tp is not needed explicitly, the tangent would suffice to construct a rotation. In 
the Hermitian case, only one additional angle a is needed, see for example |17j . 
The column orthogonalization starts by forming a 2 x 2 pivot block A p , 



[g* gj] T [gi g 3 ] 



where gi and gj denote the i-th and j-th column of G (we may assume i < j) . 

Optionally, a further processing of the column pair may be avoided if the columns arc 
relatively orthogonal (up to the machine precision e), i.e., 



lay | < ey/auajj, (2.3) 

where e is the smallest floating-point number such that £1(1 + e) > 1. 

First, the adequate rotation (trigonometric Vf T if i-th and j-th diagonal element of 
J agree, or hyperbolic V^ T otherwise) to annihilate the element aij is computed. Then, 
this transformation (|2.2f) is applied from the right to the columns [gi gj], giving 



g'i '■= fa - t&n tpgj) cos tp, 
in the trigonometric case, and 

g'i '■= (9t + tanhtpg.) cosh^, 



(.g + tanyjgj cost/? 



(<7j + tanh tpg t ) cosh tp 



(2.4) 



(2.5) 



in the hyperbolic case. The formulae (|2.4p - (|2.5p . written pointwise, represent a basic tool 
for updating the columns of G. The columns of V~ T are updated in the same fashion, 
with an appropriate change of notation, see (|2.ip . 

The parts of the formulas (|2.4p - (|2.5p written in parentheses can be performed by a 
single fused multiply-add (FMA) operation, where available, with only one rounding of 
the result, thus improving the speed, while exhibiting smaller roundoff errors. 

All Jacobi-type algorithms iterate until convergence criteria are met. Usually, these 
iterations are organized in sweeps (sometimes called cycles). For a symmetric matrix A 
of order n, a sweep is a collection of n(n — l)/2 annihilations of different elements in the 
strict upper triangle of A. In other words, in a sweep, pairs of columns of G (pivot pairs) 
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are orthogonalized as described previously. The order of pivot pairs is chosen according to 
a pivot strategy. In sequential use, widespread pivot strategies are row and column cyclic, 
since they use cache memory well (depending on layout of arrays in memory). For both 
strategies, convergence |25| and asymptotic quadratic convergence [3J have been proved. 
If the pivot pairs are orthogonalized in a cyclic manner more than once (but a prescribed 
number of times) in a 'sweep', this 'sweep' is usually called a quasi-sweep and such a pivot 
strategy is called quasi-cyclic. 

A simple convergence criterion is that no rotations occurred in a (quasi-) sweep, due 
to condition (|2.3p , which guarantees relative accuracy [53]. The process could also be 
stopped if the computed tangents are all below a predefined threshold T s :— \/e/2, i.e., 
when the quadratic convergence is detected. The second, threshold test is a replacement 
for relative orthogonality criterion (|2.3[) from [23] since we are trying to avoid an extra 
(quasi-)sweep. If we define 



hyp 

then 



— 1, in the trigonometric case, 
1, in the hyperbolic case, 



I tan2<^ if hyp = — 1, ^ I taxnp if hyp = — 1, 
1 tanh2(/3 if hyp = 1, 1 tanhtyS if hyp = 1. 

If the tangent is very small, i.e., \t\ <C 1, then 9 ~ 2t. The angle 9 is computed from the 
requirement that ~ 0, 



-an + hyp ■ cijj 

Therefore, for \t\ < T e , we have ay ~ t(—an + hyp ■ ay), and since 

■ hyp ■ I",,- , • (2-6) 

when \t\ < T e , the second term from the right-hand side in (|2.6p satisfies 

\hyp ■ ta^ | = \ta,ij\ < | • (| - an + hyp ■ a. ]:j \) < | • max{a,i, a j:j }. 

Therefore a' u w a u and a'^ « in (|2.6f) . When diag(JA) is sorted and the process is 
near the end, the off-diagonal norm of A. i.e., \\A — diag(A)||^, quadratically diminishes 
[3J, and so the off-diagonal elements. The tangents in the following (quasi-)sweep will also 
be quadratically smaller, and the diagonal updates in (|2.6p will be negligible. 

To summarize, the orthogonality convergence check is a safe fallback, with guaranteed 
numerical orthogonality of the eigenvectors, but in our testing experience, the threshold 
criterion has always happened before and terminated the process. 

The Jacobi HSVD algorithm, when computing the eigendecomposition, can be imple- 
mented using just one 2-dimensional array G, initialized to the factor G. At the end of the 
process, G holds the computed UY>. The column norms of UT, are the hyperbolic singular 
values, and normalized columns are the left singular vectors of G, i.e., the eigenvectors of 
M. If the full HSVD is desired, the sequence of applied Vf T and V^ T transformations, 
i.e., the matrix V~ T , needs to be accumulated in another 2-dimensional array V, starting 
with the identity matrix /. 

On a theoretical level, the main difference between a sequential and a parallel one-sided 
Jacobi algorithm is the choice of a pivot strategy. In any parallel algorithm, CPU or GPU 
based, a pivot strategy is chosen to simultaneously orthogonalize as many independent 
(non-overlapping) pivot pairs as possible. These pairs can be either two single columns, 
or two block-columns. Either way, this collection of independent pairs will be called a 
step. 

Our choice of parallel pivot strategy, for CPU and GPU algorithms alike, is a slightly 
modified modulus strategy [11] . If the order n of a matrix is even, n = 2q, then a quasi- 
sweep has exactly n steps, and all the elements in the upper triangle of A are annihilated 
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(some of them twice). For q independent tasks (e.g., CPU or GPU threads), exactly q 
pivot pairs are simultaneously orthogonalized in each step. 

In Fig. [T]the modified modulus pivot strategy, for n = 8 (q = 4), is illustrated from 
a point of view of the (implicit) matrix A. Each subfigure shows a single step, with 4 
pivot submatrices that can be independently processed by 4 tasks, either sequentially or 
simultaneously. After 8 steps a quasi-sweep is completed, and a new one will begin with 
the initial assignment of pivot columns to tasks reversed, as shown in the ninth subfigure. 




Figure 1: Modified modulus strategy for n — 8 and q = 4. Background colors denote 
elements annihilated in one step, and circled elements arc annihilated twice in a quasi- 
sweep. 

Our modified modulus pivot strategy differs from [TT] in the choice of a start step 
(ours is the antidiagonal) , and in annihilating exactly twice the elements of the diagonal 
(i, q + i), i = 1, . . . , q, for even n. This is a quasi-cyclic strategy, designed to speedup the 
convergence. The proof of convergence for this strategy is a work in progress. If we avoid 
double annihilation, our strategy is shift equivalent to the modulus strategy, for which 
the proof of convergence is detailed in Appendix. 

For more details of the modified modulus pivot strategy, including the efficient algo- 
rithm for step transitions, please see Section 2] and [2D]. 

3 The GPU computing platform 

GPU computing has already evolved to a mainstream technology. Although vendor- 
neutral standards (like OpenCL [§]) have emerged, we implemented GPUJACH1 for the 
NVIDIA CUDA architecture [13], due to its maturity and close ties to the underlying hard- 
ware. The algorithm is portable, more or less efficiently, to any similar single-instruction 
multiple-threads platform (SIMT), provided IEEE 754-2008 |7| double-precision floating- 
point arithmetic is available and the basic computational concepts are found in (or could 
be mapped to) the other architecture. 

In short, the CUDA architecture provides to a programmer a thin layer of abstrac- 
tions, programming tools and interfaces on top of the recent NVIDIA graphics processing 
hardware. The hardware itself is seen as a massively parallel device whose threads execute 
the same instruction sequence over (possibly) different data, stored on the device. The 
execution is initiated by the CPU (host), and a fast host-to-device and device-to-host 
data transfer is available (but not as fast as device memory access). 

This SIMT paradigm entails a careful rethinking of even the basic algorithms. Any 
branching causes a significant slowdown, since threads on divergent branches proceed se- 
quentially. The Jacobi algorithm, however, fits this paradigm perfectly, because the same 
orthogonalization primitives are applied to different pairs of matrix columns concurrently. 

A subroutine the device threads are executing is called a kernel. A kernel can be 
written in a high-level programming language (e.g., C), or using the assembly-like, low- 
level, but general-purpose instruction set of the PTX [14 virtual machine. 
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The device threads are grouped into so-called warps, for memory access optimization, 
execution scheduling and finest-grain synchronization. A warp consists of 32 threads in 
the current CUDA hardware, with two half-warps of 16 threads each. 

From a high-level perspective, blocks of threads are established. All blocks are of the 
same size, and may be seen as one, two or three dimensional arrays of threads. Threads 
are indexed and may be synchronized only within their blocks, i.e., different blocks are 
concurrent and mutually independent. For each launch of a kernel, the size of a block and 
the number of blocks are set, tailored to the nature of the algorithm and the data. Such 
a one or two dimensional array of blocks is called a grid. 

The grid for GPUJACH1 kernels, working on a factor G with even number of columns 
r, is shown in Fig. [5J 
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Figure 2: The CUDA grid for GPUJACH1. 

Memory management is the major obstacle in GPU programming, as many distinct 
memory spaces of varying speed, size and accessibility are involved. 

Input and output data is stored in the global memory, a large but slow portion of the 
GPU memory It is accessible to the CPU and all GPU threads. Improper (uncoalesced) 
access tremendously degrades performance |13| . thus the carefully aligned addressing of 
successive locations by threads with successive indices is required. 

The shared memory, on the other hand, is a small but fast device memory, allocated 
per block, and ideal for data exchange between threads of the same block. 

The constant memory, small and cached, holds kernel parameters not changing be- 
tween launches (e.g., device pointers to global memory arrays in GPUJACH1). 

Arithmetic is done in per-thread registers. GPUJACH1 uses 64-bit floating-point and 
32-bit integer arithmetic. FMA instruction (e.g., dot-products, column updates) and 
24-bit integer multiplication (e.g., address calculations) are chosen, if possible. 

GPUJACH1 targets NVIDIA GT200 graphics processor series, common in the time 
of writing. The threads are executed by 30 8-core multiprocessors (SMs). Each SM has 
a 64 kB register file, one double-precision arithmetic unit and 16 kB of shared memory. 
More than one block can reside on an SM, as the resources (register usage per thread and 
shared memory allocation per block) and other constraints [13] allow. GPUJACH1 needs 
only 512 B of shared memory per block, but 48 32-bit registers per thread, thus at most 
5 blocks can reside on an SM at a given time. 

Two distinctive features of the GT200 series are that no cache is available for the 
global memory, and that single-precision arithmetic deviates in the guaranteed accuracy 
too much from the standard [7] to be useful for GPUJACH1. Both issues are addressed 
by the more advanced architectures (e.g., NVIDIA Fermi [13 ). 

4 The GPUJACH1 algorithm 

The GPUJACH1 algorithm is an efficient parallel realization of the one-sided pointwise 
hyperbolic Jacobi SVD. GPUJACH1 provides all parts of the Jacobi process executed on 
the GPU as much as possible. 

GPUJACH1 consists of, essentially, two parts: 

• host routines (auxiliary), which are executed on the CPU, and 
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• device routines (computational), which are executed on the GPU. 
Host routines serve mainly to call device routines in a synchronized manner and collect 
the resulting information, where needed. 

A sketch of the main driver routine is given in Alg. 14.11 and then its essential parts 
are described in details. In the following, by the subscript H host variables, and by D 
device variables are denoted. Arrays are assumed to be in Fortran (column-major) order, 
but indices are zero-based (as in C). Array slices are written in Matlab fashion. 



Algorithm 4.1: The host driver routine 
Description: This is the main program, executed on the CPU. 
Input: n x r factor G, and the number p of positive signs in J. 

Output: hyperbolic singular values E and the matrices of left (U) and hyperbolic right 
{V T ) singular vectors. 

Host_Driver_Routine; 
begin 

Gd -5— Gh', Vd <— Ir\ //optional 

Pr e c omput e _Dat a; 

Sort_Diagonal; 

for sweep — to MaxSweep — 1 do 
Reset _Convergence; 
for step = to r — 1 do 

| Jacobi_Step; 
end for 

Check_Convergence; 
Sort_Diagonal; 
end for 

Gh^Gd; Vh4-Vd; D h 4- D D ; //optional 
end 



GPUJACH1 holds the n x r factor G in arrays Gh,d (we write Gh,d as a shorthand 
for Gh and Gd , similarly for V) . We may assume the factor to be square (r x r) , as left 
by preprocessing (see Section [TJ . The factor could also be preloaded on the GPU by a 
previous processing, and not needed on the CPU afterward, so Gh is optional. The factor 
must be of the full column rank, r needs to be even and not greater than n, and n, for 
performance reasons, should be a multiple of the warp size. This is not a loss of generality, 
since the initial G could easily be bordered, as in (|4.ip . to satisfy these constraints 



> £■ WarpSize. (4.1) 



If the matrix V~ T is needed, it is accumulated in the array Vd, starting from the 
identity I r , and is optionally transferred to Vh at the end. Arrays Vh,d, if used, must 
have the same number of columns as Gh,d, and the same restrictions (and appropriate 
bordering) apply. 

In principle, the execution begins with all the data residing in the main (CPU) memory, 
and should be copied to the global memory of the GPU. 

For efficiency, to access global memory data as few times as possible, and to reuse 
results of the computation while still in registers, the diagonal of the implicit pivot matrix 
A is kept and updated in each Jacobi step. While performance gains are here obvious, 
yet additional speedup can be obtained by a special diagonal sorting, described in [5T] 
for block versions of the Jacobi algorithm. Keeping the diagonal, i.e., the eigenvalues, 
sorted ensures the quadratic convergence of the Jacobi algorithm [3J [TH]- To facilitate the 
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sorting, the diagonal entries d, the composition of the sorting permutations p, and the 
signature J is packed into vectors Dh,d, as shown in Fig. [3] 
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Figure 3: Dh.d ~ packing of diagonal, permutation, and sign r-vectors in one. 

The separation of Dh,d to p entries with positive and r — p entries with negative 
sign elements of J is always maintained. Do is precomputed in Alg. 14.21 and sorted in 
Sort_Diagonal routine. The sorting key of a diagonal package is d. The first part of the 
diagonal (the one with positive signs in J) is sorted decreasingly, and the last part (with 
negative signs in J) is sorted increasingly, with respect to that key. We chose a GPU 
implementation of the merge-sort, thrust : : sort, from the Thrust library [6], to serve as 
Sort_Diagonal. Note that, at the end of the process, Do holds the sorted and squared 
hyperbolic singular values d and the final permutation p. 



Algorithm 4.2: Precompute_Data routine 

Description: Routine computes D D = (d = diag(G^G D ) , p = id r , j — diag(J p , — I r _ p )), 
and initializes the stepper vector So- 

Precompute_Data (Device_Code (block fc)); 
begin 

= (2*,2* + l); 
di=gTgi; d j =gfg :j ; 
pi = i; pj = j; 

if i < p then ji — 1 else ji — —1; if j < p then jj — 1 else jj — — 1; 
Initialize_Stepper; 
end 



Diagonal packing from Fig. [3] might seem unnecessary, when only diagonal entries and 
the permutation are strictly needed, and could be arranged into two simple, separate vec- 
tors of doubles and integers, respectively. Diagonal entries could also carry the respective 
signs from J, to simplify sorting. This approach is certainly possible, but it shows in the 
following section to be slower, due to slightly more complex Jacobi_Step routine, than 
the presented solution. Moreover, packed j incurs no extra overhead over packing solely d 
and p together - the place j occupies in the package must exist because of data alignment 
constraints of the GPU |13J . 

The CUDA grid for GPUJACH1 kernels (Fig. [2J has exactly b = r/2 blocks. Each 
block is assigned its own pair of columns of Go- In each Jacobi step these pivot pairs are 
mapped to blocks according to the modified modulus pivot strategy, implemented as an 
integer vector 

So = (ip,jp, i_blk,j_blk) 

of length 4x6, and a device routine, called Update_Stepper, which updates So at the 
end of a step, independently by each block (see Algs . 1431 and 14 . 4| ) . 

In other words, each block orthogonalizes its pivot pair (p(i_blk k ), p(j_blk k )). In a 
block, each warp is dedicated to access a single column only. This design makes the 
device code almost completely uniform for all threads, thus avoiding branching as much 
as possible. 

In the Jacobi_Step (Alg. 14. 5p columns of Go and Vo are indexed by the current 
permutation p, and no physical swapping of columns ever takes place. Therefore, at the 
end of the process, Go holds ITS, and V D holds V~ T ', in the original column order. To 
match the computed (and permuted) singular values to the singular vectors, Dh.d need 
to be permuted by the inverse of the final permutation p. 



8 



Algorithm 4.3: Initialize_Stepper routine 



Initialize_Stepper (Device_Code (block k))(r, Sd = (ip, jp, i blk, j_blk)); 

Description: Initialization of the modified modulus strategy stepper vector Sd to 

antidiagonal - k •<->■ (k, r — k — 1). Variables ip and jp are auxiliary, while 
i_ blk k and j_ blk k contain the first and the second column index 
(non-permuted) of a pivot pair for block k. 

begin 

I w k = fc ; blk k = l Pk\ 3Pk = r-k- !; 1_ blk k = iPk-> 

end 



Algorithm 4.4: Update _ Stepper routine 

Update_Stepper (Device_Code (block k))(r, Sd = {ip, jp, i_blk,j_blk)); 
Description: Block k determines the indices of the next pivot pair (i_blk k ,j_blk k ). 

begin 

if Ofc +JPk) >r-l then 

m = Wk + !; 
if = jp fc then 

I % V k = l Vk~ r l' 1 \ 3Pk = l Pk\ 
end if 

else 

I jPk = jp*; + 1 ; j_ hl K = JPfc ; 

end if 
end 



Algorithm 4.5: Jacobi_Step - the main computational routine 
Description: This is the main GPU computational routine. 

Jacobi_Step (Device_Code (block k))\ 
begin 

(i,j) = (S D (k).i_blk,S D (k).]_blk); Hi <j 
an = di\ djj — dj\ // already computed 
<kj = 9p(i)9 p (j) ; // dot-product 

//Indexing by permutation; no physical swapping of Gd columns; 
if g p (i) and g p (j) relatively orthogonal (up to given e) then goto End; 
if ji = jj then set hyp = — 1 else if ji = — jj then set hyp — 1; 
if hyp = — 1 then compute t from an, djj, ay as tan<£ else if hyp = 1 then as 
tanh tp; 

if hyp — —1 then compute c from t as cosip else if hyp = 1 then as coship; 
g' P (i) = (g P (i) + h VP ■ ^pQ))^ 9p(j) = (H>» + 9 P (j))c\ //FMA, seal 
//While updating G D columns, compute new d\ and cL; 

d 'i = {g' P (i)) T g' P (i)\ d 'i = {g' P (j)) T g' P (j)\ 

v 'p(i) = i v p(i)+hyp-tv p{j) )c; v' pU) = (tv p(l) +v p(j) )c; //FMA, seal 
Update_Convergence ; // as in J4. 2D 
End: Update_Stepper; 
end 
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The rest of Jacobi_Step fAlg. !4.5|) is more or less standard. The final question and the 
crucial efficiency issue is how dot products and column updates (daxpy-like operations) 
are performed. We shall describe only the dot product computation, since updating the 
columns is done in a similar fashion. 

For a dot product, each warp "grabs" WarpSize — 32 successive elements of its col- 
umn vector, one element per each thread. The threads then exchange these values via 
shared memory and update (FMA) their local partial sums. Finally, the partial sums 
are reduced in the shared memory, as shown in Fig. [5] This reduction at warp level is 
free of synchronization [13_, and needs to keep all the threads in a warp alive for further 
processing. 




Figure 4: Per-warp reduction (via shared memory) of the partial sums to the final dot- 
product. After each reduction stage "dashed" threads wait for the whole reduction to 
complete (but do not terminate). 



The convergence statistics is held in an integer vector Co of length b, which is updated 
independently for each block in a step. If no rotation was applied in block k, Co(fc) 
remains unchanged, otherwise it is updated according to the following formula 

= ( (11)a (4.2) 
m \(01) 2 bit_orC D (fc) if|i|<T E , 

where t is computed tangent in that block. Routine Reset_Convergence zeroes out Cd, 
and Check_Convergence returns bitwise-or reduction of Cjj. For that reduction on the 
GPU we used thrust : : reduce method of the Thrust library [6 . 

The reduction result tells whether there were no rotations in a quasi-sweep because 
the columns are orthogonal up to the machine precision (i.e., (00)2), or the quadratic 
convergence was detected (i.e., (01)2), or we must proceed further since no convergence 
criteria were fulfilled (i.e., (11)2)- Otherwise, we stop the process. 

It is worth noting that the convergence status could be accumulated by atomic bitwise- 
or global memory instruction |14| , but this approach was significantly slower. 



5 Numerical testing 

We found it intriguing to hand-code the main GPUJACH1 kernels (Jacobi_Step and 
Precompute_Data) in the PTX instruction set and wrap it up with the CUDA Driver 
API. The host part is written in C99 and C++. 

GPUJACH1 was compared to the sequential (with row cyclic pivot strategy) [S3] and 
MPI-parallel block-oriented j21] one-sided hyperbolic Jacobi algorithms. 

The parallel block-oriented algorithm is a blocked CPU version of the pointwise algo- 
rithm described here. In a step, instead of elements, (modified) modulus strategy reaches 
blocks. In the first step, inside each block Ap (formed of block columns Gp '■= [Gi Gj] as 
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Ap — Gp^Gp), each element in the strict upper triangle of Ap is annihilated only once. In 
all other steps, only the elements of the off-diagonal block GjG^ of are annihilated. 
In both cases the inner pivoting strategy is row-cyclic. To achieve both, matrix A p is fac- 
tored by the specially pivoted Cholesky factorization P T A P P = R T R, and then a HSVD 
of R is computed and the transformation matrix is applied from the right on G P . This 
blocking serves twofold purpose. The first is speeding up the process by keeping data 
(blocks or their parts) in the cache and utilizing BLAS 3 instead of BLAS 1 operations. 
The second is efficiency of the parallelization, since the number of parallel tasks is usually 
much smaller than the matrix order. 

This approach is the fastest known Jacobi-type parallel algorithm for computing the 
HSVD of the moderate-size matrices. Note that the pointwise algorithms are less amenable 
for application on a cluster, since they require heavy communication between parallel 
tasks, when they reside on different machines. 

Testing is performed on GTX280, GTX275, and TeslaC1060 graphics cards, and on 
two quad-core machines: Intel Xeon X5470 (sequential and 4-task parallel testing), and 
Intel Core i7 950 (8-task parallel testing, with hyperthreading) . For reference purposes, 
the speed benchmarks (measured in GFlops) are given in Table [1] 





Machine 
Rpeak 


Linpack 
Benchmark 
Rmax 


matrices of order 1000 x 1000 
Intel MKL dgemm Parallel block-oriented 
sequential parallel hyperbolic Jacobi 


Xeon X5470 
Core i7-950 


53.328 
48.960 


44.581 
45.972 


12.267 42.535 29.951 
12.315 39.939 35.343 



Table 1: Various measures of speed for the testing machines. 

For testing purposes, symmetric indefinite matrices with random, uniformly distributed 
spectra in [—a, —a • 10~ 5 ) U (a • 10~ 5 , a], were generated by a modified LAPACK dlagsy 
routine, rewritten, with all its support LAPACK and BLAS routines, in 80-bit extended 
precision (vs. 64-bit double). Then, matrices were factorized by the symmetric indefinite 
factorization with complete pivoting [2 (xsybpc), also in extended precision. The factor 
G is then downcasted to double. Graphically: 

dlarnd . upcast . xlagsy xsybpc downcast „ 
> A 6 4 > Ago > M 80 > Gso > <J64- 

The extended support came from GNU Fortran with real (kind=10) datatype. The 
parameter a varied depending on the matrix order n as follows in Table [2] 



n < 3168 < 6368 < 9568 < 10144 
a 20 30 40 50 



Table 2: Dimension of matrices n, and parameter a used in generation. 

Besides these cases, with approximately the same number of positive and negative 
signs in J, we also have experimented on matrices with different number of positive 
(negative) signs. The experiments confirm that the former cases are the slowest ones, 
while the definite cases (with zero, or all, signs positive) are the fastest. An example of 
this behavior is given in Table [3] 

This behavior is well-known [J . The reason lies in fact that trigonometric transforma- 
tions keep the trace of the matrix constant, while the hyperbolic transformations lower 
the trace. In exchange, the hyperbolic transformations enables the high relative accuracy 
of computed eigenvalues, and makes the two-norm of the spectral projector small when 
the eigenvalues have big relative gaps (see |24|L 
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number of positive signs 





1 


5 


10 


50 


100 


500 


1000 


2048 


time (with sorting) 
quasi-sweeps (with sorting) 


273 
13 


274 
13 


276 
13 


285 
14 


312 
15 


321 
15 


354 
16 


356 
16 


357 
16 


time (without sorting) 
quasi-sweeps (without sorting) 


306 
13 


307 
13 


308 
13 


321 
14 


333 
14 


349 
15 


360 
15 


372 
16 


368 
16 



Table 3: Speed of HSVD on GTX275 for matrices of order 4096 (with accumulation of 
V~ T ) and different number of positive signs in J. 

The spectrum A was saved in all cases, to be compared with the computed eigenvalues 
A'. The orders of matrices were 160 + 128fc, with < k < 78. Large values of k were tested 
only on TeslaC1060 (with 4 GB of memory), and skipped on GTX275 and GTX280, due 
to the cards' insufficient memory sizes of 896 MB and 1 GB, respectively. 

Relative errors in the computed eigenvalues (Fig. [5} are satisfactory in the context of 
high relative accuracy. Note that the average error is close to minimal. 




1440 2880 4320 5760 7200 8640 10080 

matrix size 



Figure 5: Relative errors in the computed eigenvalues, — ' ' . 

\w 

Distance from orthonormality for the computed eigenvectors U is calculated as d(U) = 
\\I - U T U\\ F (see [H] for details). It grows linearly form 1.11 • 10~ 14 to 7.55 • lO" 13 on 
full test spectrum. Row-distance d{U T ) differs from d(U), consistently with the bound 
d(U) = d(U T ) + 0(d 2 (U)), in the third digit, too little to be depicted. 

GPUJACH1 timing started when all data were in place, and stopped when the al- 
gorithm converged, but before the final collection of data. The similar holds for the 
sequential and MPI-parallel timings. In all cases, and for all algorithms, accumulation of 
V~ T is included in the timings. If V~ T is not needed (e.g., for the eigenvalue problem), 
the speedups are a bit higher - more than 4x in the 4-task case (for matrices of order 
n > 4800), and more than 1.25x in the 8-task task case (for matrices of order n > 8000). 
The effect of diagonal sorting on the speed of the algorithm is also illustrated in Table [31 
The similar speedup occurs on the other problem sizes. 

Timing result on common test cases (Fig. [5]) are perfectly consistent between GTX275 
and GTX280. TeslaC1060 is a bit slower because of the lower clock frequencies, but it is 
the only choice in its generation for the large problems. 

The speedup of GPUJACH1 vs. the sequential algorithm is shown in Fig. [7] The 
inevitable overhead of convergence checking (discussed in Section 2] and shown in Fig. I10[) 
makes GPUJACH1 inappropriate for small matrices, but the speedup quickly (at about 
n = 4800) stabilizes up to 17x in favor of GPUJACH1. 

The speedup of GPUJACH1 vs. the parallel block-oriented algorithm (Figs. [5] and [5J) 
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o GTX275 
6000 ■■ a GTX280 

□ Tesla C1060 

s 4800 - 
| 3600 ■■ 

2400 ■■ 

1200 ■■ 



1440 2880 4320 5760 7200 8640 10080 

matrix size 

Figure 6: Timing results of GPUJACH1 on square, full-rank matrices. 

p □ □ □ n n p 



o GTX275 
a GTX280 
p Tesla C1060 



1600 3200 4800 6400 8000 9600 

matrix size 

Figure 7: Speedup of GPUJACH1 vs. the sequential algorithm (with accumulation of 

is negligible (if none, i.e., CPU algorithm is faster), for matrices of order about 2000. 
One of the reasons is efficient caching the blocking algorithms were designed for, and 
GPUJACH1 had no caching opportunities whatsoever. When the CPU caches get too 
small for keeping the whole blocks without being frequently evicted, GPUJACH1 attains 
it peak speedup. The 3x speedup over the 4-task case, compared to 17x in the sequential 
case (if accumulation of V~ T is included) shows that the caching advantage of the block 
algorithm has dwindled. 

GPUJACH1 algorithms pays off when Jacobi_Step routine takes a predominant part 
in the overall computation time. The convergence check is the main issue here, while the 
GPU sorting time is almost negligible, except for the very small problems. All the other 
routines take well below 0.1% of the time. See Fig. [TO]for details. 

The profiling results of our algorithm are shown in Table |U The occupancy is not 
very high (5 blocks residing on a SM, instead of theoretical maximum of 8, because of the 
register file starvation), which indicates that the future effort should include simplifying 
the main Jacobi_Step kernel, maybe by breaking it down into a few lightweight kernels. 
However, the global memory throughput is satisfactory (71.91% of the theoretical limit 
of 102 GB/s on TeslaC1060). 

We have also adapted our algorithm to be able to run more (but still even) number of 
warps per block and to avoid the diagonal packing of (d, p,j) (see Fig.[3]and discussion in 
Section 2}. However, the device code became more complex (i.e., uses 53 registers and a 
couple of instructions more than GPUJACH1), and the timing results were disappointing 
(see Table HJ. 
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o GTX275 
A GTX280 
□ Tesla C1060 



1440 2880 4320 5760 7200 8640 10080 
matrix size 



Figure 8: Speedup of GPUJACH1 vs. the 4-task parallel block-oriented algorithm (with 
accum. of V~ T ). 
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Figure 9: Speedup of GPUJACH1 vs. the 8-task parallel block-oriented algorithm (with 
accum. of V~ T ). 



A plausible explanation would be that the scheduling algorithm of NVIDIA GPUs 
works reasonably well, and that it puts as much blocks on a SM in a given time as 
possible. Also, the blocks completes execution when its slowest pair of warps does. If 
you have a pair of warps performing a trigonometric transformation, and another pair 
performing a hyperbolic one in the same block, the hyperbolic one will slow down the 
entire block. As a part of a block cannot be replaced on SM by a part of another block 
(only entire blocks are changeable) , it seems reasonable to have only one pair of warps in a 
block, which turned out to be true. Therefore, having more pairs of warps per block could 
be viable if the computation of transformations, which causes imbalance of execution time, 
is isolated into a separate, lightweight kernel. 



computational 


global memory throughput [GB/s] 


instruction 


achieved 


register 


kernel 


read 


write overall 


throughput 


occupancy 


ratio 


Jacobi_Step 


50.0280 


23.3173 73.3453 


0.285392 


0.3125 


0.9375 


Precompute_Data 


84.3262 


0.3294 84.6556 


0.219198 


0.5000 


0.5000 



Table 4: Throughput and occupancy of HSVD on TeslaC1060 for matrices of order 4096 
with 2048 positive/negative signs in J (with accumulation of V~ T ). 
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Figure 10: Percentages of time. 



number of warps 


2 


4 


6 8 


time 


394 


429 


589 544 



Table 5: Speed of modified version of HSVD (without diagonal packing) on GTX275 for 
matrices of order 4096 with 2048 positive signs in J (with accum. of V~ T ) and different 
number of warps per block. 

6 Conclusions and future work 

GPUJACHl is fast and usable in its own right. However, it can be incorporated into 
the hybrid CPU-GPU parallel Jacobi framework with ease. In [2TJ, besides the block- 
oriented, the parallel full-block one-sided hyperbolic Jacobi algorithm is developed. The 
full-block algorithm diagonalizes its pivot block, while block-oriented only annihilates the 
off-diagonal elements of a block once. Instead of sequential diagonalization of a block in 
each MPI process, GPUJACHl can be plugged in as a direct replacement. The speedup 
should be the same as the speedup of GPUJACHl vs. the sequential algorithm, if each 
MPI process has access to its own GPU unit and block sizes are large enough. 

On GT200 chips there is no cache for the global memory, and the shared memory is 
small. Thus, it is difficult to avoid the slow BLAS 1 operations, and no easy blocking 
is possible. On the other hand, Fermi chips have multiple levels of cache, and a larger 
shared memory. That gives opportunity to employ BLAS 3 operations for factorization 
in GPU and the shared- memory block diagonalization (the full-block way) . 

Our future work also includes the symmetric indefinite factorization for GPU, which 
is the key missing part to have a complete CPU-based symmetric indefinite Jacobi-type 
eigensolver. 

A Appendix 

There are many parallel strategies worth to study. A bunch of them are described in 
|161 nUl ITT] . An obvious choice of strategy is to take as many as possible, i.e., [n/2\ 
independent pivot pairs for annihilation in each step, organized in n — 1 steps in a sweep 
for even n and n steps for odd n. For many strategies which satisfy this property, the 
proof of the convergence is not known. Therefore, we loosen the requirement on the 
maximal number of independent pivot pairs, but still firmly require the convergence of 
the algorithm. 

The modulus strategy (in two different shifted forms) was described in [TB] and (5]. 
The name modulus strategy for the first time appears in [TT], where on small examples, 
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its equivalence to the row cyclic strategy is illustrated. 

Here we prove that our form of modulus strategy is weakly equivalent to row cyclic 
strategy, and thus convergent. The convergence of the row cyclic strategy can be found 
in [551 Theorem 2.3]. 

Let us introduce a notation, taken from jj. In a pivot strategy, let be the 

index at which the pair (i,j) occurs. Adjacent pivot pairs and (p, q) can swap their 
positions if {i, j}f~){p, q} = 0. This transposition of pairs is called admissible transposition. 
Two pivot strategies O and O' are equivalent if one can be transformed to the other by a 
finite number of admissible transpositions. 

Two pivot strategies O and O' are shift equivalent, if one is obtained form the other 
by cyclic shift of the pivot pairs, i.e., if the position of each pair in strategy O is I(i,j), 
its new position in strategy O' is 

I'ihj) : = (I(hj) + c ) mod n v 

where n p — n(n — l)/2 and c G Z. 

Two pivot strategies O and O' are weakly equivalent, if there exist strategies Oj, 
1 < i < r, for some r > 1, such that in the sequence of strategies O, 0%, . . . , O r , O' 
each two adjacent terms arc either equivalent or shift equivalent. Obviously, proof of 
convergence for one strategy in the sequence is sufficient to prove that all strategies in the 
sequence are convergent. 

The convergence of an algorithm is usually proved for the two-sided algorithm. Since 
the one-sided transformations on a factor of a matrix are in theory (need not be numeri- 
cally) the same as the two-sided algorithm, we immediately have the proof of convergence 
for the one-sided algorithm. 

First we prove that the antidiagonal strategy is equivalent to row cyclic strategy, and 
second that the modulus strategy is weakly equivalent to antidiagonal strategy. 

Antidiagonal strategy consists of the sequence of steps shown in Fig. [TTJ (left). The 
steps of the modulus strategy are shown in Fig. [TT] (right). 




Figure 11: Antidiagonal strategy (left), and the modulus strategy (right). The order of 
steps are labeled on each antidiagonal. 

Table [5] shows the order of annihilation in the antidiagonal strategy. 

There is a minor difference between odd and even n in the step n — 1 in Table [6] i.e., 

k _ f^Y 1 , n odd, p _ik + 2, n odd, 
1^, ii even, lfe + 1, n even. 

By admissible transpositions, pivot pairs from the first column (pairs in Table 

[6] could be written before all pairs (2, • ), (3, •),..., (fc — 1, • ), since pivot indices (1, j), 
j = 5, . . . ,n are disjoint with indices of these columns (the first index in is always 
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step 



simultaneously annihilated pivot pairs 



1 


(1,2) 






2 


(1,3) 






3 


(1,4) 


(2,3) 




4 


(1,5) 


(2,4) 




5 


(1,6) 


(2,5) 


(3,4) 


- 1 


(l,n) 


(2,n-l) 


(3,n-2) 


n 




(2,n) 


(3,n-l) 



(M) 

J(fc + l,f), nodd 
1 (fc, ^ + 1), n even 

2n-2 (n-2,ra-l) (ra-2,ra) 

2n — 3 (n — 1, n) 



Table 6: Order of annihilation in the antidiagonal strategy. 



smaller than the first index in other columns, and the second is always greater than the 
second index in other columns). The rest of the proof, for indices in the second, third, 
and other columns follows by induction over column index. This completes the proof of 
equivalence of the antidiagonal and the row cyclic strategy. 

The second step in the proof is to show that antidiagonal and the modulus strategy 
are weakly equivalent. 

According to Fig. [TT] modulus strategy consists of either one or two steps of the an- 
tidiagonal strategy. Note that the antidiagonal strategy, which consists of steps 

O = (1, 2, 3, . . . , n - 2, n - 1, n, n + 1, n + 2, . . . , 2n - 4, In - 3), 

is shift equivalent to strategy which consists of steps 

O x = (n - 1, n, n + 1, n + 2, . . . , 2n - 4, 2n - 3, 1, 2, 3, . . . , n - 2). 

The pivot indices in step 1 (pair (1,2)) are disjoint with the indices contained in steps 
n + 2, . . . , In — 3 since the first index is at least 3, and the second is at least I + 1 > 2. 
Thus, cyclic strategy 0\ is equivalent to 

2 = (n - 1, n, n + 1, 1, n + 2, . . . , 2n - 4, 2n - 3, 2, 3, . . . , n - 2). 

The similar reasoning holds for step 2, with indices disjoint with indices in steps n + 
3, . . . , In — 3, indices from step 3 are disjoint with indices in steps n + 4, . . . , In — 3, and 
so on. 

The final sequence of steps is 

O' = (n- l,n,n + l,l,n + 2,2,...,2n- 3,n- 2), 

which is in fact modulus strategy, since step n — 1 of the antidiagonal strategy is the first 
step of the modulus strategy, step n is the second step of the modulus strategy, the third 
step of the modulus strategy consists of the steps n + 1,1 from the antidiagonal strategy, 
and so on. 

This proves that the modulus strategy is weakly equivalent to row cyclic strategy. 
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